Import the datasets and libraries, check datatype, statistical summary, shape, null values or incorrect imputation

In [2]:
import warnings
warnings.filterwarnings('ignore')
In [3]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

# importing ploting libraries
import matplotlib.pyplot as plt
# To enable plotting graphs in Jupyter notebook
%matplotlib inline

#importing seaborn for statistical plots
import seaborn as sns

#Let us break the X and y dataframes into training set and test set. For this we will use
#Sklearn package's data splitting function which is based on random function

from sklearn.model_selection import train_test_split

import numpy as np
from scipy import stats

# calculate accuracy measures and confusion matrix
from sklearn import metrics
In [4]:
df = pd.read_csv('Bank_Personal_Loan_Modelling.csv')
df.head()
Out[4]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [6]:
df.shape
Out[6]:
(5000, 14)
In [9]:
#Basic Info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIP Code            5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal Loan       5000 non-null   int64  
 10  Securities Account  5000 non-null   int64  
 11  CD Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
In [10]:
#Lets analysze the distribution of the various attribute
df.describe().transpose()
Out[10]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIP Code 5000.0 93152.503000 2121.852197 9307.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0
In [13]:
df.isnull().sum()
Out[13]:
ID                    0
Age                   0
Experience            0
Income                0
ZIP Code              0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal Loan         0
Securities Account    0
CD Account            0
Online                0
CreditCard            0
dtype: int64
In [15]:
df.isnull().values.any()
Out[15]:
False

EDA Data Distribution

In [11]:
df.nunique() # Number of unique values in a column
Out[11]:
ID                    5000
Age                     45
Experience              47
Income                 162
ZIP Code               467
Family                   4
CCAvg                  108
Education                3
Mortgage               347
Personal Loan            2
Securities Account       2
CD Account               2
Online                   2
CreditCard               2
dtype: int64
In [238]:
df[df['Mortgage']==0]['Mortgage'].count()
Out[238]:
3462
In [214]:
print (df.CreditCard.value_counts()) # no of people with Zero Card _ no spending
0    3530
1    1470
Name: CreditCard, dtype: int64
In [208]:
print (df.Family.value_counts())
1    1472
2    1296
4    1222
3    1010
Name: Family, dtype: int64
In [210]:
print (df.Education.value_counts())
1    2096
3    1501
2    1403
Name: Education, dtype: int64
In [215]:
print (df.Online.value_counts())
1    2984
0    2016
Name: Online, dtype: int64
In [243]:
print (df['CD Account'].value_counts())
0    4698
1     302
Name: CD Account, dtype: int64
In [244]:
print (df['Personal Loan'].value_counts())
0    4520
1     480
Name: Personal Loan, dtype: int64
In [245]:
print (df['Securities Account'].value_counts())
0    4478
1     522
Name: Securities Account, dtype: int64
Univariate & Bivariate & Data Redy for Model
In [18]:
!pip install pandas_profiling
import pandas_profiling
df.profile_report()
Collecting pandas_profiling
  Downloading pandas_profiling-2.9.0-py2.py3-none-any.whl (258 kB)
Requirement already satisfied: matplotlib>=3.2.0 in c:\users\k_sha\anaconda3\lib\site-packages (from pandas_profiling) (3.2.2)
Requirement already satisfied: numpy>=1.16.0 in c:\users\k_sha\anaconda3\lib\site-packages (from pandas_profiling) (1.18.5)
Collecting visions[type_image_path]==0.5.0
  Downloading visions-0.5.0-py3-none-any.whl (64 kB)
Requirement already satisfied: attrs>=19.3.0 in c:\users\k_sha\anaconda3\lib\site-packages (from pandas_profiling) (19.3.0)
Requirement already satisfied: ipywidgets>=7.5.1 in c:\users\k_sha\anaconda3\lib\site-packages (from pandas_profiling) (7.5.1)
Requirement already satisfied: scipy>=1.4.1 in c:\users\k_sha\anaconda3\lib\site-packages (from pandas_profiling) (1.5.0)
Requirement already satisfied: tqdm>=4.43.0 in c:\users\k_sha\anaconda3\lib\site-packages (from pandas_profiling) (4.47.0)
Collecting missingno>=0.4.2
  Downloading missingno-0.4.2-py3-none-any.whl (9.7 kB)
Collecting confuse>=1.0.0
  Downloading confuse-1.3.0-py2.py3-none-any.whl (64 kB)
Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in c:\users\k_sha\anaconda3\lib\site-packages (from pandas_profiling) (1.0.5)
Collecting phik>=0.9.10
  Downloading phik-0.10.0-py3-none-any.whl (599 kB)
Requirement already satisfied: requests>=2.23.0 in c:\users\k_sha\anaconda3\lib\site-packages (from pandas_profiling) (2.24.0)
Requirement already satisfied: jinja2>=2.11.1 in c:\users\k_sha\anaconda3\lib\site-packages (from pandas_profiling) (2.11.2)
Requirement already satisfied: joblib in c:\users\k_sha\anaconda3\lib\site-packages (from pandas_profiling) (0.16.0)
Collecting tangled-up-in-unicode>=0.0.6
  Downloading tangled_up_in_unicode-0.0.6-py3-none-any.whl (3.1 MB)
Collecting htmlmin>=0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
Requirement already satisfied: seaborn>=0.10.1 in c:\users\k_sha\anaconda3\lib\site-packages (from pandas_profiling) (0.10.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\k_sha\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas_profiling) (2.4.7)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\k_sha\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas_profiling) (2.8.1)
Requirement already satisfied: cycler>=0.10 in c:\users\k_sha\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas_profiling) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\k_sha\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas_profiling) (1.2.0)
Requirement already satisfied: networkx>=2.4 in c:\users\k_sha\anaconda3\lib\site-packages (from visions[type_image_path]==0.5.0->pandas_profiling) (2.4)
Collecting imagehash; extra == "type_image_path"
  Downloading ImageHash-4.1.0.tar.gz (291 kB)
Requirement already satisfied: Pillow; extra == "type_image_path" in c:\users\k_sha\anaconda3\lib\site-packages (from visions[type_image_path]==0.5.0->pandas_profiling) (7.2.0)
Requirement already satisfied: widgetsnbextension~=3.5.0 in c:\users\k_sha\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas_profiling) (3.5.1)
Requirement already satisfied: traitlets>=4.3.1 in c:\users\k_sha\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas_profiling) (4.3.3)
Requirement already satisfied: ipykernel>=4.5.1 in c:\users\k_sha\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas_profiling) (5.3.2)
Requirement already satisfied: nbformat>=4.2.0 in c:\users\k_sha\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas_profiling) (5.0.7)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in c:\users\k_sha\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas_profiling) (7.16.1)
Requirement already satisfied: pyyaml in c:\users\k_sha\anaconda3\lib\site-packages (from confuse>=1.0.0->pandas_profiling) (5.3.1)
Requirement already satisfied: pytz>=2017.2 in c:\users\k_sha\anaconda3\lib\site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas_profiling) (2020.1)
Requirement already satisfied: numba>=0.38.1 in c:\users\k_sha\anaconda3\lib\site-packages (from phik>=0.9.10->pandas_profiling) (0.50.1)
Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\k_sha\anaconda3\lib\site-packages (from requests>=2.23.0->pandas_profiling) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in c:\users\k_sha\anaconda3\lib\site-packages (from requests>=2.23.0->pandas_profiling) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\k_sha\anaconda3\lib\site-packages (from requests>=2.23.0->pandas_profiling) (1.25.9)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\k_sha\anaconda3\lib\site-packages (from requests>=2.23.0->pandas_profiling) (2020.6.20)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\k_sha\anaconda3\lib\site-packages (from jinja2>=2.11.1->pandas_profiling) (1.1.1)
Requirement already satisfied: six>=1.5 in c:\users\k_sha\anaconda3\lib\site-packages (from python-dateutil>=2.1->matplotlib>=3.2.0->pandas_profiling) (1.15.0)
Requirement already satisfied: decorator>=4.3.0 in c:\users\k_sha\anaconda3\lib\site-packages (from networkx>=2.4->visions[type_image_path]==0.5.0->pandas_profiling) (4.4.2)
Requirement already satisfied: PyWavelets in c:\users\k_sha\anaconda3\lib\site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.5.0->pandas_profiling) (1.1.1)
Requirement already satisfied: notebook>=4.4.1 in c:\users\k_sha\anaconda3\lib\site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (6.0.3)
Requirement already satisfied: ipython-genutils in c:\users\k_sha\anaconda3\lib\site-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1->pandas_profiling) (0.2.0)
Requirement already satisfied: jupyter-client in c:\users\k_sha\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.5.1->pandas_profiling) (6.1.6)
Requirement already satisfied: tornado>=4.2 in c:\users\k_sha\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.5.1->pandas_profiling) (6.0.4)
Requirement already satisfied: jupyter-core in c:\users\k_sha\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas_profiling) (4.6.3)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\k_sha\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas_profiling) (3.2.0)
Requirement already satisfied: setuptools>=18.5 in c:\users\k_sha\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas_profiling) (49.2.0.post20200714)
Requirement already satisfied: colorama; sys_platform == "win32" in c:\users\k_sha\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas_profiling) (0.4.3)
Requirement already satisfied: jedi>=0.10 in c:\users\k_sha\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas_profiling) (0.17.1)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\k_sha\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas_profiling) (3.0.5)
Requirement already satisfied: backcall in c:\users\k_sha\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas_profiling) (0.2.0)
Requirement already satisfied: pygments in c:\users\k_sha\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas_profiling) (2.6.1)
Requirement already satisfied: pickleshare in c:\users\k_sha\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas_profiling) (0.7.5)
Requirement already satisfied: llvmlite<0.34,>=0.33.0.dev0 in c:\users\k_sha\anaconda3\lib\site-packages (from numba>=0.38.1->phik>=0.9.10->pandas_profiling) (0.33.0+1.g022ab0f)
Requirement already satisfied: terminado>=0.8.1 in c:\users\k_sha\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (0.8.3)
Requirement already satisfied: nbconvert in c:\users\k_sha\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (5.6.1)
Requirement already satisfied: prometheus-client in c:\users\k_sha\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (0.8.0)
Requirement already satisfied: Send2Trash in c:\users\k_sha\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (1.5.0)
Requirement already satisfied: pyzmq>=17 in c:\users\k_sha\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (19.0.1)
Requirement already satisfied: pywin32>=1.0; sys_platform == "win32" in c:\users\k_sha\anaconda3\lib\site-packages (from jupyter-core->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas_profiling) (227)
Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\k_sha\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas_profiling) (0.16.0)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in c:\users\k_sha\anaconda3\lib\site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas_profiling) (0.7.0)
Requirement already satisfied: wcwidth in c:\users\k_sha\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas_profiling) (0.2.5)
Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\k_sha\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (1.4.2)
Requirement already satisfied: testpath in c:\users\k_sha\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (0.4.4)
Requirement already satisfied: bleach in c:\users\k_sha\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (3.1.5)
Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\k_sha\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (0.8.4)
Requirement already satisfied: defusedxml in c:\users\k_sha\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (0.6.0)
Requirement already satisfied: entrypoints>=0.2.2 in c:\users\k_sha\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (0.3)
Requirement already satisfied: packaging in c:\users\k_sha\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (20.4)
Requirement already satisfied: webencodings in c:\users\k_sha\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas_profiling) (0.5.1)
Building wheels for collected packages: htmlmin, imagehash
  Building wheel for htmlmin (setup.py): started
  Building wheel for htmlmin (setup.py): finished with status 'done'
  Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27090 sha256=c5549d0c111c6ff8ccb2e12023c4f00e69a27a3ebe8f520491e9b6c8ba3a5e86
  Stored in directory: c:\users\k_sha\appdata\local\pip\cache\wheels\23\14\6e\4be5bfeeb027f4939a01764b48edd5996acf574b0913fe5243
  Building wheel for imagehash (setup.py): started
  Building wheel for imagehash (setup.py): finished with status 'done'
  Created wheel for imagehash: filename=ImageHash-4.1.0-py2.py3-none-any.whl size=291996 sha256=ea2f58db5c6da6399ca5ed35a1e46035cc75fc6f40cc42db73f9fbcb7aa95348
  Stored in directory: c:\users\k_sha\appdata\local\pip\cache\wheels\cc\57\bb\6f9f52a6d8187b8f806210e3378837aa4dc6219cd64cb99846
Successfully built htmlmin imagehash
Installing collected packages: tangled-up-in-unicode, imagehash, visions, missingno, confuse, phik, htmlmin, pandas-profiling
Successfully installed confuse-1.3.0 htmlmin-0.1.12 imagehash-4.1.0 missingno-0.4.2 pandas-profiling-2.9.0 phik-0.10.0 tangled-up-in-unicode-0.0.6 visions-0.5.0



Out[18]:

In [17]:
# -ve values
df[df['Experience']<0]['Experience'].count()
Out[17]:
52
Observation: In experience column negative experience rows are identified. Needs to be replaced with Mode or mean or!!!
In [19]:
sns.pairplot(df.iloc[:,1:])  
Out[19]:
<seaborn.axisgrid.PairGrid at 0x231fd937370>
Strong correlation between experience & age
In [12]:
df.dtypes
Out[12]:
ID                      int64
Age                     int64
Experience              int64
Income                  int64
ZIP Code                int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage                int64
Personal Loan           int64
Securities Account      int64
CD Account              int64
Online                  int64
CreditCard              int64
dtype: object
In [28]:
plt.figure(figsize = (15,7))
plt.title ('Correlation Attribuites', y=1.025, size=15)
sns.heatmap(df.corr(), cmap='plasma', annot=True, fmt='0.2f')
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x2318ca551f0>

Observations:

No Missing values Min value of experience is -3, experience cant be -ve, either there a is exprience +ve integer or zero value Data is mix of Numeric (age, ID, Income, Mortgage etc), categorical (Family, Eductaion), and Boolean (CD Account, Credit card, Online etc) variables. strong correlation between age and experience co-efficient is 0.99 Personal Loan is correlable with Income, avergae spending on credit cards, mortgage (positive correlation)

In [29]:
for i in ['Age', 'Experience', 'Income', 'CCAvg', 'Education','Mortgage', 'Online', 'CreditCard', 'Family','Personal Loan', 'Securities Account', 'CD Account']:
    sns.distplot(df[i])
    plt.show()
In [32]:
df['Family'].value_counts(normalize=True)
Out[32]:
1    0.2944
2    0.2592
4    0.2444
3    0.2020
Name: Family, dtype: float64
In [36]:
df['Education'].value_counts(normalize=True)
Out[36]:
1    0.4192
3    0.3002
2    0.2806
Name: Education, dtype: float64
In [38]:
df['Securities Account'].value_counts(normalize=True)
Out[38]:
0    0.8956
1    0.1044
Name: Securities Account, dtype: float64
In [39]:
df['CD Account'].value_counts(normalize=True)
Out[39]:
0    0.9396
1    0.0604
Name: CD Account, dtype: float64
In [41]:
df['Online'].value_counts(normalize=True)
Out[41]:
1    0.5968
0    0.4032
Name: Online, dtype: float64
In [42]:
df['CreditCard'].value_counts(normalize=True)
Out[42]:
0    0.706
1    0.294
Name: CreditCard, dtype: float64
In [43]:
df['Personal Loan'].value_counts(normalize=True)
Out[43]:
0    0.904
1    0.096
Name: Personal Loan, dtype: float64
In [60]:
data_df.boxplot(return_type='axes', figsize=(20,5))
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x2318c226d60>
Observations:

90% did not accept personal Loan 70% have no credit cards 40% dont use internet bank system 30% have one kid in family Mortgage data has a lot of outliers we can remove ID, Experience and Zip code data - not really significant in this case.

In [58]:
data_df=df.drop(['ID','Experience', 'ZIP Code'], axis=1)
data_df.head()
Out[58]:
Age Income Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
0 25 49 4 1.6 1 0 0 1 0 0 0
1 45 34 3 1.5 1 0 0 1 0 0 0
2 39 11 1 1.0 1 0 0 0 0 0 0
3 35 100 1 2.7 2 0 0 0 0 0 0
4 35 45 4 1.0 2 0 0 0 0 0 1
In [67]:
pd.crosstab(data_df['Family'],data_df['Personal Loan'],normalize='index')
Out[67]:
Personal Loan 0 1
Family
1 0.927310 0.072690
2 0.918210 0.081790
3 0.868317 0.131683
4 0.890344 0.109656
In [68]:
pd.crosstab(data_df['Education'],data_df['Personal Loan'],normalize='index')
Out[68]:
Personal Loan 0 1
Education
1 0.955630 0.044370
2 0.870278 0.129722
3 0.863424 0.136576
In [69]:
pd.crosstab(data_df['Securities Account'],data_df['Personal Loan'],normalize='index')
Out[69]:
Personal Loan 0 1
Securities Account
0 0.906208 0.093792
1 0.885057 0.114943
In [70]:
pd.crosstab(data_df['CD Account'],data_df['Personal Loan'],normalize='index')
Out[70]:
Personal Loan 0 1
CD Account
0 0.927629 0.072371
1 0.536424 0.463576
In [71]:
pd.crosstab(data_df['Online'],data_df['Personal Loan'],normalize='index')
Out[71]:
Personal Loan 0 1
Online
0 0.90625 0.09375
1 0.90248 0.09752
In [72]:
pd.crosstab(data_df['CreditCard'],data_df['Personal Loan'],normalize='index')
Out[72]:
Personal Loan 0 1
CreditCard
0 0.904533 0.095467
1 0.902721 0.097279
In [95]:
PLoan_counts=pd.DataFrame(df['Personal Loan'].value_counts()).reset_index()
PLoan_counts.columns =['Category','Personal Loan']
PLoan_counts
Out[95]:
Category Personal Loan
0 0 4520
1 1 480
In [102]:
fig1, ax1 = plt.subplots()
explode = (0, 0.25)
ax1.pie(PLoan_counts["Personal Loan"], explode=explode, labels=PLoan_counts["Category"], autopct='%1.2f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  
plt.title("% Personal Loan")
plt.show()
In [104]:
sns.catplot(x='Family', y='Income', hue='Personal Loan', data = df, kind='swarm')
Out[104]:
<seaborn.axisgrid.FacetGrid at 0x2318ae05370>
In [105]:
sns.boxplot(x='Education', y='Income', hue='Personal Loan', data = df)
Out[105]:
<matplotlib.axes._subplots.AxesSubplot at 0x2318dff8e20>
In [106]:
sns.boxplot(x="Education", y='Mortgage', hue="Personal Loan", data=df)
Out[106]:
<matplotlib.axes._subplots.AxesSubplot at 0x2318dd620d0>
In [107]:
sns.countplot(x='Family',data=df,hue='Personal Loan')
Out[107]:
<matplotlib.axes._subplots.AxesSubplot at 0x2318e6b6970>
In [108]:
df.groupby('Personal Loan')['CCAvg'].mean().plot(kind='bar')
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0x2318e037d30>
In [109]:
df.groupby('Personal Loan')['Income'].mean().plot(kind='bar')
Out[109]:
<matplotlib.axes._subplots.AxesSubplot at 0x2318debc460>
Observations:

high salary Over 80-100K accepted Personal loan in this case. High CCAvg has accepted personal loan. CD account holders accept personal loan more.

Data Split

In [111]:
## Define X and Y variables

X = data_df.drop('Personal Loan', axis=1)
Y = data_df['Personal Loan'].astype('category')     

# change data typr to 'category' because it was object type which is not recognized by model
In [112]:
#Convert categorical vriables to dummy variables
X = pd.get_dummies(X, drop_first=True)

Modelling & parameters

In [179]:
##Split into training and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size=0.30,random_state=7)
In [180]:
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_auc_score,accuracy_score
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(random_state=4294967295,fit_intercept=False)
logreg.fit(X_train, y_train)                    # fit the model on train data
Out[180]:
LogisticRegression(fit_intercept=False, random_state=4294967295)
In [181]:
y_predict = logreg.predict(X_test)             # Predicting the target variable on test datalogreg_model = LogisticRegression()
In [182]:
# Observe the predicted and observed classes in a dataframe.

z = X_test.copy()
z['Observed Personal Loan'] = y_test
z['Predicted Personal Loan'] = y_predict
z.head()
Out[182]:
Age Income Family CCAvg Education Mortgage Securities Account CD Account Online CreditCard Observed Personal Loan Predicted Personal Loan
3406 42 34 3 2.0 3 0 0 0 0 1 0 0
757 52 81 3 1.8 2 0 1 0 0 0 0 0
3624 58 70 1 1.4 3 0 0 0 0 0 0 0
4544 28 80 3 2.5 1 0 0 0 1 0 0 0
3235 60 39 2 1.6 3 0 0 0 1 0 0 0
In [183]:
## function to get confusion matrix in a proper format
def draw_cm( actual, predicted ):
    cm = confusion_matrix( actual, predicted)
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1] )
    plt.ylabel('Observed')
    plt.xlabel('Predicted')
    plt.show()## function to get confusion matrix in a proper format
In [184]:
print("Trainig accuracy",logreg.score(X_train,y_train))  
print()
print("Testing accuracy",logreg.score(X_test, y_test))
print()
print('Confusion Matrix')
print(draw_cm(y_test,y_predict))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
Trainig accuracy 0.9134285714285715

Testing accuracy 0.9153333333333333

Confusion Matrix
None

Recall: 0.36231884057971014

Precision: 0.5617977528089888

F1 Score: 0.44052863436123346

Roc Auc Score: 0.6668422396731151
In [185]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc
In [186]:
y_predict=logreg_model.predict(X_test)
print(classification_report(y_test,y_predict))
print(accuracy_score(y_test,y_predict))
print(confusion_matrix(y_test, y_predict))
              precision    recall  f1-score   support

           0       0.96      0.98      0.97      1362
           1       0.78      0.61      0.68       138

    accuracy                           0.95      1500
   macro avg       0.87      0.80      0.83      1500
weighted avg       0.94      0.95      0.95      1500

0.948
[[1338   24]
 [  54   84]]
In [187]:
# !pip install yellowbrick

# Additional

#AUC ROC curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
In [188]:
## Feature Importance or Coefficients 
fi = pd.DataFrame()
fi['Col'] = X_train.columns
fi['Coeff'] = np.round(abs(logreg.coef_[0]),2)
fi.sort_values(by='Coeff',ascending=False)
Out[188]:
Col Coeff
7 CD Account 3.13
8 Online 1.24
9 CreditCard 1.09
4 Education 0.46
2 Family 0.20
6 Securities Account 0.19
0 Age 0.09
1 Income 0.02
3 CCAvg 0.02
5 Mortgage 0.00

Observation

Logistic Regression for binary output prediction Personal Loan variable is used as dependent variable. split data is 30-70% Accuracy (training & testing) is 91% but what it means- majority of customer are non buyers (high %) as compared to small % of buyer of Personal Loan. So accuracy of this model is debateable. Though 61% of recall means this model did well in predicting positive, but alarming AUC value- area under the curve is 67%. May be filtering of data for example outliers for Mortgage, Salary may increase accuracy, recall and AUC - but that might be overdoing as we already know higher salary customers are buying product of personal Loan. Same thing with scaling the attributes as next step or using other models to confirm the trends. Its obvious for marketting the Personal Loan , product needs to be more attractive for low to medium income as those customers are not in the list and improve the decision tree.

In [ ]:
 
In [ ]: